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Abstract 

Fourteen linguistically-motivated numeri- 
cal indicators are evaluated for their abil- 
ity to categorize verbs as either states or 
events. The values for each indicator are 
computed automatically across a corpus 
of text. To improve classification perfor- 
mance, machine learning techniques are 
employed to combine multiple indicators. 
Three machine learning methods are com- 
pared for this task: decision tree induction, 
a genetic algorithm, and log-linear regres- 
sion. 

1 Introduction 

The ability to distinguish states, e.g., "Mark seems 
happy," from events, e.g., "Renee ran down the 
street, " is a necessary prerequisite for interpreting 
certain adverbial adjuncts, as well as identifying 
temporal constraints between sentences in a dis- 
course (Moens and Steedman, 1988; Dorr, 1992; Kla- 
vans, 1994). Furthermore, stativity is the first of 
three fundamental temporal distinctions that com- 
pose the aspectual class of a clause. Aspectual clas- 
sification is a necessary component for a system 
that analyzes temporal constraints, or performs lex- 
ical choice and tense selection in machine transla- 
tion (Moens and Steedman, 1988; Passonneau, 1988; 
Dorr, 1992; Klavans, 1994). 

Researchers have used empirical analysis of cor- 
pora to develop linguistically-based numerical indi- 
cators that aid in aspectual classification (Klavans 
and Chodorow, 1992; Siegel and McKeown, 1996). 
Specifically, this technique takes advantage of lin- 
guistic constraints that pertain to aspect, e.g., only 
clauses that describe an event can appear in the pro- 
gressive. Therefore, a verb that appears more fre- 
quently in the progressive is more likely to describe 
an event. 

In this paper, we evaluate fourteen quantitative 
linguistic indicators for their ability to classify verbs 
according to stativity. The values of these indicators 
are computed automatically across a corpus of text. 



Classification performance is then measured over an 
unrestricted set of verbs. Our analysis reveals a pre- 
dictive value for several indicators that have not tra- 
ditionally been linked to stativity in the linguistics 
literature. Then, in order to improve classification 
performance, we apply machine learning methods to 
combine multiple indicators. Three machine learn- 
ing techniques are compared for this task: decision 
tree induction, a genetic algorithm, and log-linear 
regression. 

In the following sections, we further detail and 
motivate the distinction between states and events. 
Next, we describe our approach, detailing the set of 
linguistic indicators, the corpus and tools used, and 
the machine learning methods. Finally, we present 
experimental results and discuss conclusions and fu- 
ture work. 

2 Stative and Event Verbs 

Stativity must be identified to detect temporal con- 
straints between clauses attached with when. For ex- 
ample, in interpreting, "She had good strength when 
objectively tested, the have-state began before or 
at the beginning of the test-event, and ended after 
or at the end of the iesi-event. However, in inter- 
preting, "Phototherapy was discontinued when the 
bilirubin came down to 13, " the discontinue-ewent 
began at the end of the come-event. As another ex- 
ample, the simple present reading of an event, e.g., 
"He jogs," denotes the habitual reading, i.e., "every 
day, " whereas the simple present reading of a state, 
e.g., "He appears healthy, " implies "at the moment. " 

Identifying stativity is the first step toward aspec- 
tually classifying a clause. Events are further distin- 
guished by two additional features: 1) telic events 
have an explicit culminating point in time, while 
non-telic events do not, and 2) extended events have 
a time duration, while atomic events do not. De- 
tecting the telicity and atomicity of a clause is neces- 
sary to identify temporal constraints between clauses 
and to interpret certain adverbial adjuncts (Moens 

^ These examples of when come from the corpus of 
medical discharge summaries used for this work. 



If a verb can occur: 


...then it must be: 


in the progressive 


Extended Event 


with a temporal adverb 
(e.g., then) 


Event 


with a duration m-PP 
(e.g., in an hour) 


Telic Event 


in the perfect tense 


Telic Event or State 



Table 1: Example linguistic constraints excerpted 
from Klavans (1994). 



and Steedman, 1988; Passonneau, 1988; Dorr, 1992; 
Klavans, 1994). However, since these features apply 
only to events and not to states, a clause first must 
be classified according to stativity. 

Certain features of a clause, such as adjuncts and 
tense, are constrained by and contribute to the as- 
pectual class of the clause (Vendler, 1967; Dowty, 
1979; Pustejovsky, 1991; Passonneau, 1988; Klavans, 
1994). Examples of such constraints are listed in 
Table |^. Each entry in this table describes a syn- 
tactic aspectual marker and the constraints on the 
aspectual class of any clause that appears with that 
marker. For example, a telic event can be modified 
by a duration m-PP, as in "You found us there in 
ten minutes, " but a state cannot, e.g., "*You loved 
him in ten minutes." 

In general, the presence of these linguistic markers 
in a particular clause indicates a constraint on the 
aspectual class of the clause, but the absence thereof 
does not place any constraint. This makes it difficult 
for a system to aspectually classify a clause based 
on the presence or absence of a marker. Therefore, 
these linguistic constraints are best exploited by a 
system that measures their frequencies across verbs. 

Klavans and Chodorow (1992) pioneered the ap- 
plication of statistical corpus analysis to aspectual 
classification by placing verbs on a "stativity scale" 
according to the frequency with which they occur 
in the progressive. This way, verbs are automati- 
cally ranked according to their propensity towards 
stativity. We have previously applied this princi- 
ple towards distinguishing telic events from non-telic 
events (Siegel and McKeown, 1996). Classification 
performance was increased by combining multiple 
aspectual markers with a genetic algorithm. 

3 Approach 

Our goal is to exploit linguistic constraints such as 
those listed in Table ^ by counting their frequencies 
in a corpus. For example, it is likely that event verbs 
will occur more frequently in the progressive than 
state verbs, since the progressive is constrained to 
occur with event verbs. Therefore, the frequency 
with which a verb occurs in the progressive indicates 
whether it is an event or stative verb. 



We have evaluated 14 such linguistic indicators 
over clauses selected uniformly from a text corpus. 
In this way, we are measuring classification perfor- 
mance over an unrestricted set of verbs. First, the 
ability for each indicator to individually distinguish 
between stative and event verbs is evaluated. Then, 
in order in increase classification performance, ma- 
chine learning techniques are employed to combine 
multiple indicators. 

In this section, we first describe the set of lin- 
guistic indicators used to discriminate events and 
states. Then, we show how machine learning is used 
to combine multiple indicators to improve classifica- 
tion performance. Three learning methods are com- 
pared for this task. Finally, we describe the corpus 
and evaluation set used for these experiments. 

3.1 Linguistic Indicators 

The first column of Table || lists the 14 linguistic in- 
dicators evaluated in this paper for classifying verbs. 
The second and third columns show the average 
value for each indicator over stative and event verbs, 
respectively, as computed over a c orpu s of parsed 



clauses, described below in Section 3.3. These val- 
ues, as well as the third column, are further detailed 
in Section IJ. 

Each verb has a unique value for each indicator. 
The first indicator, frequency, is simply the the fre- 
quency with which each verb occurs. As shown in 
Table ^ stative verbs occur more frequently than 
event verbs in our corpus. 

The remaining 13 indicators measure how fre- 
quently each verb occurs in a clause with the lin- 
guistic marker indicated. This list includes the four 
markers listed in Table 0, as well as 9 additional 
markers that have not previously been linked to sta- 
tivity. For example, the next three indicators listed 
m Table I measure the frequency with which verbs 
1) are modified by not or never, 2) are modified by 
a temporal adverb such as then or frequently, and 
3) have no deep subject (passivized phrases often 
have no deep subject, e.g., "She was admitted to the 
hospital"). As shown, stative verbs are modified by 
not or never more frequently than event verbs, but 
event verbs are modified by temporal adverbs more 
frequently than stative verbs. For further detail re- 
garding the set of 14 indicators, see Siegel (1997). 

An individual indicator can be used to classify 
verbs by simply establishing a threshold; if a verb's 
indicator value is below the threshold, it is assigned 
one class, otherwise it is assigned the alternative 
class. For example, in Table which shows the 
predominant class and four indicator values corre- 
sponding to each of four verbs, a threshold of 1.00% 
would allow events to be distinguished from states 
based on the values of the not/never indicator. The 
next subsection describes how all 14 indicators can 
be used together to classify verbs. 









"not" or 


temporal 


no deep 


Verb 


class 


f req 


"never" 


adverb 


subject 


show 


state 


2,131 


1.55% 


0.52% 


18.07% 


admit 


event 


1,895 


0.05% 


1.11% 


91.13% 


discharge 


event 


1,608 


0.50% 


1.87% 


96.64% 


feel 


state 


1,177 


4.61% 


1.20% 


52.52% 



Table 3: Example verbs and their indicator values. 



Linguistic 


Stative 


Event 


T-test 


Indicator 


Mean 


Mean 


P-value 


frequency 


932.89 


667.57 


0.0000 


"not" or "never" 


4.44% 


1.56% 


0.0000 


temporal adverb 


1.00% 


2.70% 


0.0000 


no deep subject 


36.05% 


57.56% 


0.0000 


past/pres participle 


20.98% 


15.37% 


0.0005 


duration m-PP 


0.16% 


0.60% 


0.0018 


perfect 


2.27% 


3.44% 


0.0054 


present tense 


11.19% 


8.94% 


0.0901 


progressive 


1.79% 


2.69% 


0.0903 


manner adverb 


0.00% 


0.03% 


0.1681 


evaluation adverb 


0.69% 


1.19% 


0.1766 


past tense 


62.85% 


65.69% 


0.2314 


duration /or-PP 


0.59% 


0.61% 


0.8402 


continuous adverb 


0.04% 


0.03% 


0.8438 



Table 2: 
classes. 



3.2 



Indicators discriminate between two 



Combining Indicators with Learning 

Given a verb and its 14 indicator values, our goal 
is to use all 14 values in combination to classify the 
verb as a state or an event. Once a function for com- 
bining indicator values has been established, previ- 
ously unobserved verbs can be automatically classi- 
fied according to their indicator values. This section 
describes three machine learning methods employed 
to this end. 

Log-linear regression. As suggested by Klavans 
and Chodorow (1992), a weighted sum of multi- 
ple indicators that results in one "overall" indica- 
tor may provide an increase in classification perfor- 
mance. This method embodies the intuition that 
each indicator correlates with the probability that 
a verb describes an event or state, but that each 
indicator has its own unique scale, and so must be 
weighted accordingly. One way to determine these 
weights is log-linear regression (Santner and Duffy, 
1989), a popular technique for binary classification. 
This technique, which is more extensive than a sim- 
ple weighted sum, applies an inverse logit function, 
and employs the iterative reweighted least squares 
algorithm (Baker and Nelder, 1989). 
Genetic programming. An alternative to avoid 
the limitations of a linear combination is to gener- 



ate a non-linear function tree that combines multiple 
indicators. A popular method for generating such 
function trees is a genetic algorithm (Holland, 1975; 
Goldberg, 1989). The use of genetic algorithms to 
generate function trees (Cramer, 1985; Koza, 1992) 
is frequently called genetic programming. The func- 
tion trees are generated from a set of 17 primi- 
tives: the binary functions ADD, MULTIPLY and 
DIVIDE, and 14 terminals corresponding to the 14 
indicators listed in Table |. This set of primitives 
was established empirically; conditional functions, 
subtraction, and random constants failed to change 
performance significantly. The polarities for several 
indicators were reversed according to the polarities 
of the weights established by log-linear regression. 
Because the genetic algorithm is stochastic, each run 
may produce a different function tree. Runs of the 
genetic algorithm have a population size of 500, and 
end after 50,000 new individuals have been evalu- 
ated. 

A threshold must be selected for both linear and 
function tree combinations of indicators. This way, 
overall outputs can be discriminated such that classi- 
fication performance is maximized. For both meth- 
ods, this threshold is established over the training 
set and frozen for evaluation over the test set. 
Decision trees. Another method capable of mod- 
cling non-linear relationships between indicators is a 
decision tree. Each internal node of a decision tree is 
a choice point, dividing an individual indicator into 
ranges of possible values. Each leaf node is labeled 
with a classification (state or event). Given the set of 
indicator values corresponding to a verb, that verb's 
class is established by deterministically traversing 
the tree from the root to a leaf. The most popular 
method of decision tree induction, employed here, 
is recursive partitioning (Quinlan, 1986; Breiman et 
al., 1984), which expands the tree from top to bot- 
tom. The Splus statistical package was used for the 
induction process, with parameters set to their de- 
fault values. 

Previous efforts in corpus-based natural lan- 
guage processing have incorporated machine learn- 
ing methods to coordinate multiple linguistic indica- 
tors, e.g., to classify adjectives according to marked- 
ness (Hatzivassiloglou and McKeown, 1995), to per- 
form accent restoration (Yarowsky, 1994), for dis- 
ambiguation problems (Yarowsky, 1994; Luk, 1995), 





n 


States 


Events 


be 

have 

all other verbs 


23,409 
7,882 
66,682 


100.0% 
69.9% 
16.2% 


0.0% 
30.1% 
83.8% 



Table 4: Breakdown of verb occurrences. 



and for the automatic identification of semantically 
related groups of words (Pereira, Tishby, and Lee, 
1993; Hatzivassiloglou and McKeown, 1993). For 
more detail on the machine learning experiments de- 
scribed here, see Siegel (1997). 

3.3 A Parsed Corpus 

The automatic identification of individual con- 
stituents within a clause is necessary to compute the 
values of the linguistic indicators in Table ||. The 
Enghsh Slot Grammar (ESG) (McCord, 1990) has 
previously been used on corpora to accumulate as- 
pectual data (Klavans and Chodorow, 1992). ESG 
is particularly attractive for this task since its out- 
put describes a clause's deep roles, detecting, for ex- 
ample, the deep subject and object of a passivized 
phrase. 

Our experiments are performed across a 1,159,891 
word corpus of medical discharge summaries from 
which 97,973 clauses were parsed fully by ESG, with 
no self-diagnostic errors (ESG produced error mes- 
sages on some of this corpus' complex sentences). 
The values of each indicator in Table ^ are com- 
puted, for each verb, across these 97,973 clauses. 

In this paper, we evaluate our approach over verbs 
other than be and have, the two most frequent verbs 
in this corpus. Table ^ shows the distribution of 
clauses with be, have, and remaining verbs as their 
main verb. Clauses with be as their main verb al- 
ways denote states. Have is highly ambiguous, so 
the aspectual classification of clauses headed by have 
must incorporate additional constituents. For ex- 
ample, "The patient had Medicaid" denotes a state, 
while, "The patient had an enema" denotes an event. 
In separate work, we have shown that the semantic 
category of the direct object of have informs classi- 
fication according to stativity (Siegel, 1997). Since 
the remaining problem is to increase the classifica- 
tion accuracy over the 68.1% of clauses that have 
main verbs other than be and have, all results are 
measured only across that portion of the corpus. As 
shown in Table ^, 83.8% of clauses with verbs other 
than be and have are events. 

A portion of the parsed clauses must be manu- 
ally classified to provide supervised training data for 
the three learning methods mentioned above, and to 
provide a separate set of test data with which to eval- 
uate the classification performance of our system. To 
this end, we manually marked 1,851 clauses selected 
uniformly from the set of parsed clauses not headed 



by be or have. As a linguistic test to mark according 
to stativity, each clause was [tested for readability 
with "What happened was... 'a Of these, 373 were 
rejected because of parsing problems (verb or direct 
object incorrectly identified). This left 1,478 parsed 
clauses, which were divided equally into 739 training 
and 739 testing cases. 

Some verbs can denote both states and events, 
depending on other constituents of the clause. For 
example, .show denotes a state in "His lumbar punc- 
ture showed evidence of white cells, " but denotes an 
event in "He showed me the photographs. " However, 
in this corpus, most verbs other than have are highly 
dominated by one sense. Of the 739 clauses included 
in the training set, 235 verbs occurred. Only 11 of 
these verbs were observed as both states and events. 
Among these, there was a strong tendency towards 
one sense. For example, show appears primarily as 
a state. Only five verbs - say, state, supplement, 
describe, and lie, were not dominated by one class 
over 80% of the time. Further, each of these were 
observed less than 6 times a piece, which makes the 
estimation of sense dominance inaccurate. 

The limited presence of verbal ambiguity in the 
test set does, however, place an upper bound of 
97.4% on classification accuracy, since linguistic in- 
dicators are computed over the main verb only. 

4 Results 

Since we are evaluating our approach over verbs 
other than be and have, the test set is only 16.2% 
states, as shown in Table ^. Therefore, simply clas- 
sifying every verb as an event achieves an accuracy 
of 83.8% over the 739 test cases, since 619 are events. 
However, this approach classifies all stative clauses 
incorrectly, achieving a stative recall of 0.0%. This 
method serves as a baseline for comparison since 
we are attempting to improve over an uninformed 
approach. cl 

4.1 Individual Indicators 

The second and third columns of Table || show the 
average value for each indicator over stative and 
event clauses, as measured over the 739 training ex- 
amples. As described above, these examples exclude 
be and have. For example, 4.44% of stative clauses 
are modified by either not or never, but only 1.56% 
of event clauses were modified by these adverbs. The 
fourth column shows the results of T-tests that com- 
pare the indicator values over stative verbs to those 
over event verbs. For example, there is less than 
a 0.05% chance that the difference between stative 
and event means for the first four indicators listed 

^This test was suggested by Judith Klavans (personal 
communication) . 

Similar baselines for comparison have been used for 
many classification problems (Duda and Hart, 1973), 
e.g., part-of-speech tagging (Church, 1988; Allen, 1995). 



is due to chance. Overall, this shows that the differ- 
ences in stative and event averages are statistically 
significant for the first seven indicators listed (p < 

This analysis has revealed correlations between 
verb class and five indicators that have not been 
linked to stativity in the linguistics literature. Of the 
top seven indicators shown to have positive correla- 
tions with stativity, three have been linguistically 
motivated, as shown in Table The other four 
were not previously hypothesized to correlate with 
aspectual class: (1) verb frequency, (2) occurrences 
modified by "not" or "never", (3) occurrences with 
no deep subject, and (4) occurrences in the past or 
present participle. Furthermore, the last of these 
seven, occurrences in the perfect tense, was not pre- 
viously hypothesized to correlate with stativity in 
particular. 

However, a positive correlation between indicator 
value and verb class does not necessarily mean an 
indicator can be used to increase classification ac- 
curacy. Each indicator was tested individually for 
its ability to improve classification accuracy over the 
baseline by selecting the best classification threshold 
over the training data. Only two indicators, verb 
frequency, and occurrences with not and never, 
were able to improve classification accuracy over 
that obtained by classifying all clauses as events. 
To validate that this improved accuracy, the thresh- 
olds established over the training set were used over 
the test set, with resulting accuracies of 88.0% and 
84.0%, respectively. Binomial tests showed the first 
of these to be a significant improvement over the 
baseline of 83.8%, but not the second. 

4.2 Combining Indicators 

All three machine learning methods successfully 
combined indicator values, improving classification 
accuracy over the baseline measure. As shown in Ta- 
ble ||, the decision tree's accuracy was 93.9%, genetic 
programming's function trees had an average accu- 
racy of 91.2% over seven runs, and the log-linear re- 
gression achieved an 86.7% accuracy. Binomial tests 
showed that both the decision tree and genetic pro- 
gramming achieved a significant improvement over 
the 88.0% accuracy achieved by the frequency indi- 
cator alone. Therefore, we have shown that machine 
learning methods can successfully combine multi- 
ple numerical indicators to improve the accuracy by 
which verbs are classified. 

The differences in accuracy between the three 
methods are each significant (p < .01). Therefore, 
these results highlight the importance of how linear 
and non-linear interactions between numerical lin- 
guistic indicators are modeled. 

4.3 Improved Recall Tradeoff 

The increase in the number of stative clauses cor- 
rectly classified, i.e. stative recall, illustrates a more 



dramatic improvement over the baseline. As shown 
in Table |, stative recalls of 74.2%, 47.4% and 34.2% 
were achieved by the three learning methods, as 
compared to the 0.0% stative recall achieved by the 
baseline, while only a small loss in recall over event 
clauses was suffered. The baseline does not classify 
any stative clauses correctly because it classifies all 
clauses as events. This difference in recall is more 
dramatic than the accuracy improvement because of 
the dominance of event clauses in the test set. 

This favorable tradeoff between recall values 
presents an advantage for applications that weigh 
the identification of stative clauses more heavily 
than that of event clauses. For example, a preposi- 
tional phrase denoting a duration with for, e.g., "for 
a minute," describes the duration of a state, e.g., 
"She felt sick for two weeks, " or the duration of the 
state that results from a telle event, e.g., "She left the 
room for a minute. " That is, correctly identifying 
the use of for depends on identifying the stativity 
of the clause it modifies. A language understand- 
ing system that incorrectly classifies "She felt sick 
for two weeks" as a non-telic event will not detect 
that "for two weeks" describes the duration of the 
/ee/-state. If this system, for example, summarizes 
durations, it is important to correctly identify states. 
In this case, our approach is advantageous. 

5 Conclusions and Future Work 

We have compiled a set of fourteen c^uantitativc lin- 
guistic indicators that, when used together, signifi- 
cantly improve the classification of verbs according 
to stativity. The values of these indicators are mea- 
sured automatically across a corpus of text. 

Each of three machine learning techniques success- 
fully combined the indicators to improve classifica- 
tion performance. The best of the three, decision 
tree induction, achieved a classification accuracy of 
93.9%, as compared to the uninformed baseline's ac- 
curacy of 83.8%. Furthermore, genetic programming 
and log-linear regression also achieved improvements 
over the baseline. These results were measured over 
an unrestricted set of verbs. 

The improvement in classification performance is 
more dramatically illustrated by the favorable trade- 
off between stative and event recall achieved by all 
three of these methods, which is profitable for tasks 
that weigh the identification of states more heavily 
than events. 

This analysis has revealed correlations between 
stativity and five indicators that are not tradition- 
ally linked to stativity in the linguistic literature. 
Furthermore, one of these four, verb frequency, in- 
dividually increased classification accuracy from the 
baseline method to 88.0%. 

To classify a clause, the current system uses only 
the indicator values corresponding to the clause's 
main verb. This procedure could be expanded to 





overall 

accuracy 


States 


Events 


recall precision 


recall precision 


decision tree 


93.9% 


74.2% 86.4% 


97.7% 95.1% 


genetic programming 


91.2% 


47.4% 97.3% 


99.7% 90.7% 


log-linear 


86.7% 


34.2% 68.3% 


96.9% 88.4% 


baseline 


83.8% 


0.0% 100.0% 


100.0% 83.8% 



Table 5: Comparison of three learning methods and a performance baseline. 



incorporate rules that classify a clause directly from 
clausal features (e.g., Is the main verb show, is the 
clause in the progressive?), or by calculating indi- 
cator values over other clausal constituents in addi- 
tion to the verb (Siegel and McKeown, 1996; Siegel, 
1997). 

Classification performance may also improve by 
incorporating additional linguistic indicators, such 
as co-occurrence with rate adverbs, e.g., quickly, or 
occurrences as a complement of force or persuade, as 
suggested by Klavans and Chodorow (1992). 
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